Conversation
|
@microsoft-github-policy-service agree |
|
fix #284 :
|
| // Extract and decompress the int2 values | ||
| int32_t compressed = B_compressed[compressed_block_idx * 32 + tile_idx]; | ||
| int8_t decompressed[16]; | ||
| decode_i2s_to_i8s(&compressed, decompressed); |
There was a problem hiding this comment.
many threads will dequant i2s with same weight, could we create a pre-process to cache the dequant result to share memory
There was a problem hiding this comment.
Yes, blocks with the same blockN but different blockM will dequant the same weight. However, shared memory is only accessible to threads within the same thread block. If we want to cache the dequant result, I think we either dequant all weights in global memory, or we have to loop on M in every block to reuse the weight, which may lead to splitK to maximize parallel. How do you think we can implement this to optimize?
…LOCK_SIZE_N adjustable
Add gemm kernel for int2 weight. Also fix scaling problems in previous bitlinear kernel.